1  Intro to Data Manipulation with dplyr

“I’ve missed more than 9000 shots in my career. I’ve lost almost 300 games. 26 times, I’ve been trusted to take the game winning shot and missed. I’ve failed over and over and over again in my life. And that is why I succeed.” - Michael Jordan

Michael Jordan hits the game winner over Craig Ehlo during Game 5 of a 1989 playoff series.

In sports analytics, one of the first steps in understanding and analyzing data is manipulation—transforming raw data into a form suitable for analysis. The ability to manipulate data efficiently is vital for uncovering insights in sports, such as analyzing player statistics, team performance, and game outcomes.

In this chapter, we introduce dplyr, a powerful R package designed for data manipulation. dplyr allows you to perform a range of common data wrangling tasks in a clear, concise, and readable way. Using dplyr functions, you can filter observations, select variables, compute summaries, and transform datasets. These operations will form the foundation for more advanced analyses in later chapters.

The dplyr package provides several key functions for manipulating data: - mutate() to add new variables or transform existing ones. - select() to choose specific columns of data. - filter() to subset the data based on conditions. - summarise() to aggregate data. - arrange() to reorder rows of data.

In this chapter, we will explore each of these functions in detail using sports data from the Lahman package, a commonly used dataset in sports analytics. The Lahman package includes historical Major League Baseball (MLB) data, such as player statistics, team performance, and other key metrics.

1.1 Using mutate() to Add or Transform Variables

The mutate() function in dplyr is an essential tool for creating new variables or modifying existing ones in a dataset. In the context of sports analytics, this function is particularly useful when you want to compute new metrics or adjust raw data into more meaningful forms. For example, a player’s raw performance statistics, such as hits or points scored, can be transformed into ratios or averages, providing more insightful performance measures such as batting averages or efficiency ratings.

1.1.1 Syntax

The basic syntax of mutate() is as follows:

mutate(data, new_variable = expression)

Here, data is the name of the dataset you are working with, new_variable is the name of the variable you are adding or modifying, and expression is the formula used to calculate the value of the new variable.

We can also use the pipe operator as follows:

data |> 
mutate(new_variable = expression)

1.1.2 Why Use mutate()?

In sports analytics, it is often necessary to create variables that represent metrics or statistics derived from existing data. For example:

  • Creating a new variable that represents a player’s scoring efficiency.
  • Calculating a team’s win rate by dividing the number of wins by the total number of games played.
  • Computing advanced metrics such as player efficiency rating (PER) in basketball or on-base plus slugging (OPS) in baseball.

By transforming raw data into meaningful statistics, you can uncover new insights that are not immediately apparent in the original dataset.

Example 1.1 (Calculating Batting Average) Let’s begin with a basic example using data from the Lahman package, which contains Major League Baseball (MLB) statistics. We will calculate each player’s batting average (BA), a widely used performance metric that represents the ratio of a player’s hits (H) to at-bats (AB).

library(tidyverse)
library(Lahman)

# Example: Adding Batting Average (BA) to Batting data
batting_data = Batting |>
  mutate(BA = H / AB)

head(batting_data)
   playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB SO
1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0  0
2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4  0
3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2  5
4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0  2
5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2  1
6 armstbo01   1871     1    FW1   NA 12  49  9 11   2   1  0   5  0  1  0  1
  IBB HBP SH SF GIDP        BA
1  NA  NA NA NA    0 0.0000000
2  NA  NA NA NA    0 0.2711864
3  NA  NA NA NA    1 0.2919708
4  NA  NA NA NA    0 0.3308271
5  NA  NA NA NA    0 0.3250000
6  NA  NA NA NA    0 0.2244898

In this example, we use mutate() to create a new variable BA, which stores the batting average for each player. This new column is derived from dividing the number of hits (H) by the number of at-bats (AB).

1.1.3 Handling Missing Values with mutate()

In some cases, you may encounter missing or undefined values in your dataset. When using mutate() to create new variables, it’s important to handle these missing values to avoid errors or incorrect calculations. In R, you can use the na.rm = TRUE argument in relevant functions (like sum() or mean()) to remove missing values before performing calculations.

Example 1.2 (Handling division by zero) Let’s modify the batting average calculation to handle cases where the number of at-bats (AB) is zero, which would otherwise lead to a division by zero error.

batting_data = Batting |>
  mutate(BA = ifelse(AB == 0, NA, H / AB))

head(batting_data)
   playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB SO
1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0  0
2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4  0
3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2  5
4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0  2
5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2  1
6 armstbo01   1871     1    FW1   NA 12  49  9 11   2   1  0   5  0  1  0  1
  IBB HBP SH SF GIDP        BA
1  NA  NA NA NA    0 0.0000000
2  NA  NA NA NA    0 0.2711864
3  NA  NA NA NA    1 0.2919708
4  NA  NA NA NA    0 0.3308271
5  NA  NA NA NA    0 0.3250000
6  NA  NA NA NA    0 0.2244898

In this case, we use ifelse() to check if a player has zero at-bats (AB == 0). If this condition is met, we assign a missing value (NA) to the batting average, otherwise, the batting average is calculated as usual.

1.1.4 Multiple Transformations with mutate()

The mutate() function can be used to perform multiple transformations in a single call. This is useful when you need to create several new variables based on existing data. You can also overwrite existing variables if needed.

Example 1.3 (Performing multiple transformations with mutate())  

batting_data = Batting |>
  mutate(BA = ifelse(AB == 0, NA, H / AB),
         OBP = (H + BB + HBP) / (AB + BB + SF + HBP),
         SLG = (H - X2B - X3B - HR + 2*X2B + 3*X3B + 4*HR) / AB, 
         OPS = OBP + SLG
         )

In this example, we calculate several new statistics—batting average (BA), on-base percentage (OBP), slugging percentage (SLG), and On-base Plus Slugging (OPS)—all within the same mutate() call.

The mutate() function is a versatile and powerful tool for creating new variables and transforming existing ones. In sports analytics, this ability is invaluable when calculating advanced statistics that provide deeper insights into player and team performance. Whether you’re calculating simple metrics like batting averages or more complex statistics like on-base plus slugging, mutate() allows you to efficiently manipulate and analyze your data. In the next sections, we will explore additional dplyr functions that complement mutate(), such as select(), filter(), summarise(), and arrange().

1.2 Using select() to Choose Variables

The select() function in dplyr is a powerful tool for selecting specific columns (or variables) from a dataset. In sports analytics, datasets can contain a large number of variables, many of which may not be immediately relevant to your analysis. By using select(), you can streamline your workflow, focusing on only the columns that are essential for the task at hand.

The function allows you to extract specific columns based on their names, making the data more manageable and readable, especially when you’re working with large datasets like those from the NBA, MLB, or other sports leagues.

1.2.1 Syntax

The basic syntax of select() is as follows:

data |>
  select(column1, column2, ...)

Here, data is the name of the dataset, and column1, column2, etc., represent the columns you wish to select from the dataset.

1.2.2 Why Use select()?

In sports analytics, it is common to work with large datasets that contain numerous variables, such as player names, performance metrics, team information, game statistics, and more. Often, you only need a few of these columns for a specific analysis. By using select(), you can focus on the relevant variables, simplifying your data and improving performance when working with large datasets.

For example:

  • Selecting player performance statistics such as points, assists, and rebounds.
  • Narrowing down a dataset to include only specific game information, such as game date, team, and score.
  • Choosing variables that represent team-level statistics over a season, like win totals or offensive efficiency.

Example 1.4 (Selecting Key Batting Statistics) Let’s use the Lahman dataset again, which contains a wealth of baseball statistics. Suppose we are only interested in a few key batting statistics, such as player ID (playerID), year (yearID), at-bats (AB), hits (H), and home runs (HR). We can use select() to extract these columns:

library(dplyr)
library(Lahman)

batting_data_selected = Batting |>
  select(playerID, yearID, AB, H, HR)

head(batting_data_selected)
   playerID yearID  AB  H HR
1 abercda01   1871   4  0  0
2  addybo01   1871 118 32  0
3 allisar01   1871 137 40  0
4 allisdo01   1871 133 44  2
5 ansonca01   1871 120 39  0
6 armstbo01   1871  49 11  0

In this example, the select() function filters the dataset down to just the columns playerID, yearID, AB, H, and HR. This is especially useful when you’re conducting specific analyses on player performance metrics like at-bats (AB), hits (H), and home runs (HR).

1.2.3 Selecting Columns by Range

You can also select columns by specifying a range using the colon (:) operator. This is helpful when dealing with consecutive columns that you wish to include in your analysis.

Example 1.5 (Selecting a range of columns) Let’s say you want to select all the columns between AB and HR in the batting dataset.

batting_data_range = Batting |>
  select(AB:HR)

head(batting_data_range)
   AB  R  H X2B X3B HR
1   4  0  0   0   0  0
2 118 30 32   6   0  0
3 137 28 40   4   5  0
4 133 28 44  10   2  2
5 120 29 39  11   3  0
6  49  9 11   2   1  0

In this example, the select(AB:HR) syntax selects all columns from AB to HR inclusively. This feature is handy when you want to work with a block of consecutive columns.

1.2.4 Dropping Columns

In some cases, you might want to exclude certain columns from your dataset rather than select specific ones. You can do this using the - sign in combination with select() to drop unwanted columns.

Example 1.6 (Dropping a specific column) Let’s say you want to keep all columns except for the player’s ID (playerID).

batting_data_dropped = Batting |>
  select(-playerID)

head(batting_data_dropped)
  yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB SO IBB HBP SH
1   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0  0  NA  NA NA
2   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4  0  NA  NA NA
3   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2  5  NA  NA NA
4   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0  2  NA  NA NA
5   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2  1  NA  NA NA
6   1871     1    FW1   NA 12  49  9 11   2   1  0   5  0  1  0  1  NA  NA NA
  SF GIDP
1 NA    0
2 NA    0
3 NA    1
4 NA    0
5 NA    0
6 NA    0

Here, select(-playerID) removes the playerID column from the dataset, returning all other columns. This is useful when you want to retain most of the dataset but exclude a few irrelevant variables.

1.2.5 Using Helper Functions with select()

dplyr also provides helper functions that allow you to select columns based on certain patterns or criteria. These include:

  • starts_with(): Select columns whose names start with a certain prefix.
  • ends_with(): Select columns whose names end with a certain suffix.
  • contains(): Select columns whose names contain a certain string.
  • matches(): Select columns that match a regular expression.

These functions are particularly useful when working with datasets where columns follow a naming convention, such as stats that begin with “player_” or end with “_rate.”

Example 1.7 (Selecting Columns that Start with a Prefix) Suppose you are working with a basketball dataset and want to select all statistics related to shooting that start with field_goals. In this example, we will use the hoopR package in R. This package provide functions for fetching NBA and men’s college basketball data.

library(hoopR)

#load the data for 2023
basketball_data = load_nba_player_box(seasons = 2023)


shooting_stats = basketball_data |>
  select(starts_with("field_goals"))

head(shooting_stats)
# A tibble: 6 × 2
  field_goals_made field_goals_attempted
             <int>                 <int>
1                1                     4
2                5                    18
3                9                    20
4                3                    13
5                5                    12
6                2                     6

In this example, select(starts_with("field_goals")) extracts all columns whose names begin with “field_goals,” such as field_goals_made, and field_goals_attempted.

The select() function is a highly flexible tool for extracting relevant columns from large datasets, making it easier to work with and analyze the data that matters most. Whether you’re working with baseball, basketball, or any other sport, select() can streamline your workflow, helping you focus on the specific variables that are critical to your analysis. By combining select() with helper functions, such as starts_with(), ends_with(), and others, you can perform more advanced selections based on patterns or column names, further enhancing your ability to manipulate sports data.

In the next section, we’ll explore how to subset your data using filter() to focus on specific cases or observations based on conditions, such as selecting players above a certain performance threshold.

1.3 Using filter() to Subset the Data

The filter() function in dplyr is used to subset rows in a dataset based on specified conditions. This function allows you to extract observations that meet certain criteria, which is crucial in sports analytics when you need to focus on specific players, teams, or events. Whether you’re interested in filtering by player performance, team rankings, or game outcomes, filter() enables you to work with just the relevant data for your analysis.

1.3.1 Syntax

The basic syntax of filter() is as follows:

data |> 
  filter(condition1, condition2, ...)

Here, data is the dataset, and condition1, condition2, etc., are the conditions that the rows must meet to be included in the subset. Conditions typically involve logical comparisons, such as >, <, ==, !=, and %in%.

1.3.2 Why Use filter()?

In sports analytics, datasets often contain a wealth of information about many players, games, or seasons. However, your analysis might focus on a specific subset of this data, such as:

  • Filtering players with more than a certain number of points or minutes played.
  • Focusing on teams with a winning percentage above a specific threshold.
  • Subsetting games where a team scored over 100 points.
  • Isolating data from a particular season, team, or player.

By using filter(), you can isolate the rows that are most relevant to your analysis, making it easier to conduct meaningful evaluations or comparisons.

Example 1.8 (Filtering Players with More than 200 At-Bats) Let’s return to the Lahman dataset and use filter() to subset players who had more than 200 at-bats in a given season. This kind of filtering is useful when analyzing only those players who had significant playing time.

library(dplyr)
library(Lahman)

batting_filtered = Batting |>
  filter(AB > 200)

head(batting_filtered)
   playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB SO
1 ansonca01   1872     1    PH1   NA 46 217 60 90  10   7  0  48  6  6 16  3
2 barnero01   1872     1    BS1   NA 45 230 81 99  28   2  1  44 12  2  9  4
3 bechtge01   1872     1    NY2   NA 51 247 61 74  11   3  0  42  9  1  7  3
4 cummica01   1872     1    NY2   NA 55 249 37 52   9   3  0  26  0  1  4 14
5 cuthbne01   1872     1    PH1   NA 47 260 83 88  10   0  1  47 14  4  6 10
6 eggleda01   1872     1    NY2   NA 56 290 94 97  20   0  0  19 18  6  8  9
  IBB HBP SH SF GIDP
1  NA  NA NA NA    2
2  NA  NA NA NA    3
3  NA  NA NA NA    2
4  NA  NA NA NA    2
5  NA  NA NA NA    2
6  NA  NA NA NA    0

In this example, we use filter(AB > 200) to keep only the rows where players had more than 200 at-bats (AB). This ensures that our analysis focuses on players with sufficient playing time, avoiding the inclusion of those who had limited opportunities.

1.3.3 Filtering on Multiple Conditions

You can use filter() to apply multiple conditions at once, allowing for more precise subsetting. Multiple conditions can be combined with logical operators such as & (AND) and | (OR).

Example 1.9 (Filtering Players with More than 200 At-Bats and More Than 20 Home Runs)  

batting_filtered_multiple = Batting |>
  filter(AB > 200 & HR > 20)

head(batting_filtered_multiple)
   playerID yearID stint teamID lgID   G  AB   R   H X2B X3B HR RBI SB CS BB SO
1 ansonca01   1884     1    CHN   NL 112 475 108 159  30   3 21 102 NA NA 29 13
2 dalryab01   1884     1    CHN   NL 111 521 111 161  18   9 22  69 NA NA 14 39
3 pfefffr01   1884     1    CHN   NL 112 467 105 135  10  10 25 101 NA NA 25 47
4 willine01   1884     1    CHN   NL 107 417  84 116  18   8 27  84 NA NA 42 56
5 freembu01   1899     1    WAS   NL 155 588 107 187  19  25 25 122 21 NA 23 25
6 schulfr01   1911     1    CHN   NL 154 577 105 173  30  21 21 107 23 NA 76 71
  IBB HBP SH SF GIDP
1  NA  NA NA NA   NA
2  NA  NA NA NA   NA
3  NA  NA NA NA   NA
4  NA  NA NA NA   NA
5  NA  18  5 NA   NA
6  NA   3 31 NA   NA

Here, we use filter(AB > 200 & HR > 20) to select players who had more than 200 at-bats and also hit more than 20 home runs in a season. This type of filtering is useful for identifying high-performing players who both played regularly and demonstrated power-hitting ability.

1.3.4 Filtering with Logical Operators

You can apply more complex logical conditions with filter(). For example, if you wanted to find players who either had more than 200 at-bats or hit more than 20 home runs (but not necessarily both), you could use the OR operator |.

Example 1.10 (Using OR condition)  

batting_filtered_or = Batting |>
  filter(AB > 200 | HR > 20)

head(batting_filtered_or)
   playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB SO
1 ansonca01   1872     1    PH1   NA 46 217 60 90  10   7  0  48  6  6 16  3
2 barnero01   1872     1    BS1   NA 45 230 81 99  28   2  1  44 12  2  9  4
3 bechtge01   1872     1    NY2   NA 51 247 61 74  11   3  0  42  9  1  7  3
4 cummica01   1872     1    NY2   NA 55 249 37 52   9   3  0  26  0  1  4 14
5 cuthbne01   1872     1    PH1   NA 47 260 83 88  10   0  1  47 14  4  6 10
6 eggleda01   1872     1    NY2   NA 56 290 94 97  20   0  0  19 18  6  8  9
  IBB HBP SH SF GIDP
1  NA  NA NA NA    2
2  NA  NA NA NA    3
3  NA  NA NA NA    2
4  NA  NA NA NA    2
5  NA  NA NA NA    2
6  NA  NA NA NA    0

In this example, players are included in the subset if they meet either of the conditions: more than 200 at-bats or more than 20 home runs.

1.3.5 Using %in% to Filter by a Set of Values

The %in% operator can be used within filter() to check whether a column’s values belong to a specific set. This is particularly useful when you want to filter by specific categories, such as player IDs, teams, or years.

Example 1.11 (Filtering Data for Specific Years) Let’s say you want to analyze data from only the 2010 and 2015 seasons.

batting_filtered_years = Batting |>
  filter(yearID %in% c(2010, 2015))

head(batting_filtered_years)
   playerID yearID stint teamID lgID   G  AB  R   H X2B X3B HR RBI SB CS BB  SO
1 aardsda01   2010     1    SEA   AL  53   0  0   0   0   0  0   0  0  0  0   0
2  abadfe01   2010     1    HOU   NL  22   1  0   0   0   0  0   0  0  0  0   1
3 abreubo01   2010     1    LAA   AL 154 573 88 146  41   1 20  78 24 10 87 132
4 abreuto01   2010     1    ARI   NL  81 193 16  45  11   1  1  13  2  1  4  47
5 accarje01   2010     1    TOR   AL   5   0  0   0   0   0  0   0  0  0  0   0
6 aceveal01   2010     1    NYA   AL  10   0  0   0   0   0  0   0  0  0  0   0
  IBB HBP SH SF GIDP
1   0   0  0  0    0
2   0   0  0  0    0
3   3   2  0  5   13
4   0   0  0  4    8
5   0   0  0  0    0
6   0   0  0  0    0

In this example, we use filter(yearID %in% c(2010, 2015)) to retain only the rows where the yearID column matches 2010 or 2015. This approach is helpful when working with data from specific seasons or games.

1.3.6 Filtering Missing Data

In many datasets, missing values (represented as NA in R) can cause problems if not handled properly. You can use filter() to exclude rows with missing values or to isolate rows where certain values are missing.

Example 1.12 (Filtering Out Missing Data) Let’s filter out rows where the number of at-bats (AB) is missing.

# Removing rows with missing values for at-bats
batting_no_na = Batting |>
  filter(!is.na(AB))

head(batting_no_na)
   playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB SO
1 abercda01   1871     1    TRO   NA  1   4  0  0   0   0  0   0  0  0  0  0
2  addybo01   1871     1    RC1   NA 25 118 30 32   6   0  0  13  8  1  4  0
3 allisar01   1871     1    CL1   NA 29 137 28 40   4   5  0  19  3  1  2  5
4 allisdo01   1871     1    WS3   NA 27 133 28 44  10   2  2  27  1  1  0  2
5 ansonca01   1871     1    RC1   NA 25 120 29 39  11   3  0  16  6  2  2  1
6 armstbo01   1871     1    FW1   NA 12  49  9 11   2   1  0   5  0  1  0  1
  IBB HBP SH SF GIDP
1  NA  NA NA NA    0
2  NA  NA NA NA    0
3  NA  NA NA NA    1
4  NA  NA NA NA    0
5  NA  NA NA NA    0
6  NA  NA NA NA    0

Here, we use filter(!is.na(AB)) to remove rows where the value for at-bats (AB) is missing (NA). This ensures that we are working only with complete data.

The filter() function is an indispensable tool for narrowing down large datasets based on specific criteria. Whether you’re filtering by player performance, season, or game outcomes, filter() allows you to extract just the rows that are most relevant to your analysis. In sports analytics, this enables you to focus on specific players, teams, or metrics that drive meaningful insights.

In the next section, we will explore how to use the summarise() function to aggregate data, such as calculating team-level statistics or summarizing player performance over a season or career.

1.4 Using summarise() to Aggregate Data

The summarise() function in dplyr is used to create summary statistics from your dataset by collapsing multiple rows into a single value. This function is essential in sports analytics, where you often need to aggregate data to calculate overall metrics, such as total points, average performance statistics, or team-level summaries. With summarise(), you can compute a wide variety of statistics such as sums, averages, minimums, and maximums, allowing for powerful aggregation of data.

1.4.1 Syntax

The basic syntax of summarise() is as follows:

data |> 
  summarise(new_summary = aggregation_function(column))

Where:

  • data is the dataset.
  • new_summary is the name of the new summary statistic.
  • aggregation_function(column) is the function applied to the column you want to summarize (e.g., mean(), sum(), min(), max()).

1.4.2 Why Use summarise()?

In sports analytics, it is often necessary to calculate overall performance metrics, such as:

  • Summing a player’s total points over multiple seasons.
  • Calculating a team’s average points per game.
  • Aggregating player data to find league-wide averages or totals.
  • Computing summary statistics like the highest or lowest scoring games in a season.

Using summarise(), you can efficiently compute these summary statistics and gain high-level insights from raw data.

1.4.3 Grouping Data Before Summarizing

When using summarise(), it’s common to group data by one or more variables before applying the summary function. This is achieved with the group_by() function, which allows you to summarize data across different categories, such as by player, team, or season.

For example:

  • Summarizing total points per player across multiple games.
  • Calculating the average performance metrics per team in a season.
  • Aggregating statistics by season to track trends over time.

Example 1.13 (Summarizing Total Home Runs per Year) Let’s use the Lahman dataset to calculate the total number of home runs hit by all players for each year. First, we will group the data by year and then use summarise() to compute the total home runs.

library(dplyr)
library(Lahman)

# Summarizing total home runs per year
home_runs_per_year <- Batting |>
  group_by(yearID) |>
  summarise(total_HR = sum(HR, na.rm = TRUE))

head(home_runs_per_year)
# A tibble: 6 × 2
  yearID total_HR
   <int>    <int>
1   1871       47
2   1872       37
3   1873       47
4   1874       40
5   1875       40
6   1876       40

In this example, we first use group_by(yearID) to group the data by year, and then summarise() is used to compute the total number of home runs (HR) hit in each year. The na.rm = TRUE argument ensures that missing values are ignored during the summation. This type of summary can help analyze trends in home run performance over time.

Example 1.14 (Calculating Average Batting Average per Year) You can also calculate averages using summarise(). Let’s calculate the average batting average (BA) for each year by grouping the data by yearID and then using the mean() function within summarise().

# Calculating average batting average per year
batting_average_per_year <- Batting |>
  group_by(yearID) |>
  summarise(avg_BA = mean(H / AB, na.rm = TRUE))

head(batting_average_per_year)
# A tibble: 6 × 2
  yearID avg_BA
   <int>  <dbl>
1   1871  0.254
2   1872  0.246
3   1873  0.252
4   1874  0.240
5   1875  0.218
6   1876  0.211

In this example, we compute the average batting average (avg_BA) for each year. The mean() function calculates the average of hits divided by at-bats, and na.rm = TRUE ensures that missing data is handled appropriately.

Multiple Summaries with summarise()

You can compute multiple summary statistics in a single summarise() call by simply listing additional summary expressions separated by commas.

Example 1.15 (Summarizing Both Total and Average Home Runs per Year)  

# Summarizing total and average home runs per year
home_run_stats <- Batting |>
  group_by(yearID) |>
  summarise(total_HR = sum(HR, na.rm = TRUE),
            avg_HR = mean(HR, na.rm = TRUE))

head(home_run_stats)
# A tibble: 6 × 3
  yearID total_HR avg_HR
   <int>    <int>  <dbl>
1   1871       47  0.409
2   1872       37  0.236
3   1873       47  0.376
4   1874       40  0.325
5   1875       40  0.184
6   1876       40  0.323

In this example, we calculate both the total home runs (total_HR) and the average home runs per player (avg_HR) for each year. This provides both a high-level view (total) and a per-player perspective (average) of home run performance.

1.4.4 Adding Summary Statistics to the Original Data

In some cases, you may want to add summary statistics back to your original dataset. This can be achieved by combining mutate() with summarise() to append calculated statistics to each row.

Example 1.16 (Adding Average Points per Player to Each Team)  

library(hoopR)

#load the data for 2023
basketball_data = load_nba_player_box(seasons = 2023)


nba_stats = basketball_data |>
  group_by(team_name) |> 
  summarize(avg_pts_team = mean(points, na.rm = TRUE)) |> 
  right_join(basketball_data, by = "team_name")

head(nba_stats)
# A tibble: 6 × 58
  team_name avg_pts_team   game_id season season_type game_date 
  <chr>            <dbl>     <int>  <int>       <int> <date>    
1 76ers             10.5 401545117   2023           3 2023-05-14
2 76ers             10.5 401545117   2023           3 2023-05-14
3 76ers             10.5 401545117   2023           3 2023-05-14
4 76ers             10.5 401545117   2023           3 2023-05-14
5 76ers             10.5 401545117   2023           3 2023-05-14
6 76ers             10.5 401545117   2023           3 2023-05-14
# ℹ 52 more variables: game_date_time <dttm>, athlete_id <int>,
#   athlete_display_name <chr>, team_id <int>, team_location <chr>,
#   team_short_display_name <chr>, minutes <dbl>, field_goals_made <int>,
#   field_goals_attempted <int>, three_point_field_goals_made <int>,
#   three_point_field_goals_attempted <int>, free_throws_made <int>,
#   free_throws_attempted <int>, offensive_rebounds <int>,
#   defensive_rebounds <int>, rebounds <int>, assists <int>, steals <int>, …

In this example, we first calculate the average points per team (avg_pts_team) and then use right_join() to merge this summary statistic back into the original dataset, so that each player’s row now contains their team’s average points.

The summarise() function is a critical tool for aggregating and summarizing data in sports analytics. Whether you’re calculating total points, average performance metrics, or team-level statistics, summarise() allows you to collapse large datasets into meaningful summary statistics. By combining summarise() with group_by(), you can gain insights into trends across seasons, teams, or players, providing a comprehensive view of the data.

In the next section, we’ll explore the arrange() function, which is used to order and rank your data, making it easier to identify top performers and trends.

1.5 Using arrange() to Order Rows

The arrange() function in dplyr is used to reorder the rows of a dataset based on the values of one or more variables. This function is particularly useful in sports analytics when you want to rank players or teams according to performance metrics, such as points scored, assists made, or wins achieved. By using arrange(), you can quickly identify top performers, sort data chronologically, or rank teams based on standings.

1.5.1 Syntax

The basic syntax of arrange() is as follows:

data |> 
  arrange(column1, column2, ...)

Where:

  • data is the dataset.
  • column1, column2, etc., are the columns by which the data should be ordered.

By default, arrange() sorts the rows in ascending order. To sort in descending order, you can wrap the column name in desc().

1.5.2 Why Use arrange()?

In sports analytics, ranking and sorting data is a common task. Whether you’re determining the highest scorers, ordering teams by win percentage, or ranking players by efficiency, arrange() allows you to easily reorder your data to facilitate comparisons and insights.

For example, you may want to:

  • Rank basketball players by points scored per game.
  • Sort baseball teams by total wins.
  • Rank athletes by efficiency ratings, from highest to lowest.

Example 1.17 (Arranging by Batting Average) Let’s use the Lahman dataset again and sort the players by their batting average (BA). First, we need to calculate the batting average using mutate() and then sort the dataset in descending order to identify the top performers.

library(dplyr)
library(Lahman)

batting_data <- Batting |>
  mutate(BA = H / AB) |>
  arrange(desc(BA))

head(batting_data)
   playerID yearID stint teamID lgID G AB R H X2B X3B HR RBI SB CS BB SO IBB
1  snowch01   1874     1    BR2   NA 1  1 0 1   0   0  0   0  0  0  0  0  NA
2 baldwki01   1884     2    CHU   UA 1  1 0 1   0   0  0  NA NA NA  0 NA  NA
3 mccafsp01   1889     1    CL6   AA 2  1 1 1   0   0  0   0  0 NA  1  0  NA
4 gumbebi01   1893     1    LS3   NL 1  1 0 1   1   0  0   2  0 NA  0  0  NA
5 oconnfr01   1893     1    PHI   NL 3  2 1 2   0   0  1   3  0 NA  0  0  NA
6 brownpe01   1894     2    BRO   NL 1  2 1 2   0   0  0   2  0 NA  1  0  NA
  HBP SH SF GIDP BA
1  NA NA NA    0  1
2  NA NA NA   NA  1
3   0 NA NA   NA  1
4   0 NA NA   NA  1
5   0 NA NA   NA  1
6   0  0 NA   NA  1

In this example, we first calculate the batting average (BA) using mutate() and then use arrange(desc(BA)) to sort the data in descending order of batting average. This allows us to see the players with the highest batting averages at the top of the dataset.

1.5.3 Arranging by Multiple Variables

You can also sort data by multiple columns. For example, if two players have the same batting average, you might want to break the tie by looking at their home run totals.

Example 1.18 (Arranging by Batting Average and Home Runs)  

batting_data_sorted <- Batting |>
  mutate(BA = H / AB) |>
  arrange(desc(BA), desc(HR))

head(batting_data_sorted)
   playerID yearID stint teamID lgID  G AB R H X2B X3B HR RBI SB CS BB SO IBB
1 oconnfr01   1893     1    PHI   NL  3  2 1 2   0   0  1   3  0 NA  0  0  NA
2 lefebbi01   1938     1    BOS   AL  1  1 1 1   0   0  1   1  0  0  0  0  NA
3 sleatlo01   1958     1    DET   AL  4  1 1 1   0   0  1   1  0  0  0  0   0
4 narumbu01   1963     1    BAL   AL  7  1 1 1   0   0  1   2  0  0  0  0   0
5 quirkja01   1984     2    CLE   AL  1  1 1 1   0   0  1   1  0  0  0  0   0
6 perrypa02   1988     2    CHN   NL 35  1 1 1   0   0  1   2  0  0  0  0   0
  HBP SH SF GIDP BA
1   0 NA NA   NA  1
2   0  0 NA   NA  1
3   0  0  0    0  1
4   0  0  0    0  1
5   0  0  0    0  1
6   0  0  0    0  1

Here, we first sort by batting average (BA) in descending order, and in the case of ties, we further arrange the rows by home runs (HR) in descending order. This type of multi-variable sorting is useful for resolving ties or ranking data based on multiple criteria.

1.5.4 Sorting Chronologically

In many sports datasets, you might want to sort data chronologically to observe trends over time. This can be done by arranging data by a date or year variable.

Example 1.19 (Sorting by Year) Let’s sort the ‘Lahman’ dataset by the yearID variable to view the data in chronological order.

batting_data_by_year <- Batting |>
  arrange(desc(yearID))

head(batting_data_by_year)
   playerID yearID stint teamID lgID  G  AB  R  H X2B X3B HR RBI SB CS BB SO
1 abbotco01   2022     1    WAS   NL 16   0  0  0   0   0  0   0  0  0  0  0
2 abramcj01   2022     1    SDN   NL 46 125 16 29   5   0  2  11  1  2  4 27
3 abramcj01   2022     2    WAS   NL 44 159 17 41   7   2  0  10  6  2  1 23
4 abreual01   2022     1    TEX   AL  7   0  0  0   0   0  0   0  0  0  0  0
5 abreual01   2022     2    KCA   AL  4   0  0  0   0   0  0   0  0  0  0  0
6 abreual01   2022     3    NYA   AL 22   0  0  0   0   0  0   0  0  0  0  0
  IBB HBP SH SF GIDP
1   0   0  0  0    0
2   0   6  2  2    4
3   0   3  0  0    1
4   0   0  0  0    0
5   0   0  0  0    0
6   0   0  0  0    0

This example sorts the data in ascending order by yearID, allowing us to see the dataset arranged chronologically by season. Sorting by year or date is especially useful when analyzing time trends in performance or observing how a player’s career evolves over time.

The arrange() function is a valuable tool in sports analytics for ordering and ranking data. Whether you’re sorting players by performance metrics, ranking teams based on wins, or organizing data chronologically, arrange() allows you to structure your data in a way that facilitates meaningful analysis. You can arrange data by one or more columns, and by using desc(), you can sort in descending order to prioritize top performers.

1.6 Test Your Comprehension

  1. What is the primary function of the mutate() function in dplyr?

  2. Which function in dplyr is used to select specific columns from a dataset?

  3. In the context of the mutate() function, what would happen if you attempt to divide a value by zero without handling it?

  4. Which of the following functions is used to reorder rows in a dataset?

  5. To extract rows of data that meet specific conditions, such as “players with more than 200 at-bats”, which function would you use?

  6. True or False: The select() function allows you to drop specific columns by prefixing their names with a minus sign (-).

  7. True or False: In the filter() function, using & combines conditions in such a way that both conditions must be true for a row to be included in the subset.

  8. True or False: The arrange() function in dplyr can only sort data in ascending order.